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Abstract. This study investigated the effect of cloze item practice on 
reading comprehension, where cloze items were either created by humans, 
by machine using natural language processing techniques, or randomly. 
Participants from Amazon Mechanical Turk (N = 302) took a pre-test, 
read a text, and took part in one of five conditions, Do-Nothing, Re-Read, 
Human Cloze, Machine Cloze, or Random Cloze, followed by a 24-hour 
retention interval and post-test. Participants used the MoFaCTS sys- 
tem [27], which in cloze conditions presented items adaptively based on 
individual success with each item. Analysis revealed that only Machine 
Cloze was significantly higher than the Do-Nothing condition on post- 
test, d = .58, C95 [.21, .94]. Additionally, Machine Cloze was significantly 
higher than Human and Random Cloze conditions on post-test, d = .49, 
Clp5[.12, .86] and d = .71, CIg5[.34, 1.09] respectively. These results sug- 
gest that Machine Cloze items generated using natural language process- 
ing techniques are effective for enhancing reading comprehension when 
delivered by an adaptive practice scheduling system. 
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1 Introduction 


Reading has long been one of the preeminent means of learning new information. 
Reading to learn necessarily involves comprehension, the process by which infor- 
mation in the text is reconciled with prior knowledge. Theorists differ on the 
precise mechanisms underlying the role of prior knowledge in reading compre- 
hension, though there is considerable overlap across theories [19]. The differences 
that exist between theories may be partly attributable to differing ideas about 
how knowledge is represented and applied. Experimental results, however, have 
broadly found that prior knowledge exhibits a strong positive effect on reading 
comprehension [1,3,15]. Prior knowledge also moderates the effect of reading 
ability on comprehension. When prior knowledge is high, the effect of reading 
ability on comprehension vanishes [28]. Prior knowledge also influences whether 
reading ability interacts with text difficulty to influence comprehension [26]. 
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Altogether the evidence suggests that prior knowledge has a central role, if not 
the central role, in reading comprehension. 

If reading to learn requires prior knowledge, but the goal of reading to learn 
is to acquire new knowledge, then it seems there is a kind of circular causal- 
ity between knowledge and reading. In educational practice, this relationship 
becomes apparent when the curricular focus shifts from the mechanics of read- 
ing, ie. decoding fluency, to content area reading with the emphasis on learn- 
ing from text. This shift is often marked by a sudden drop in reading scores, 
particularly in students from low income families [5]. Long referred to as the 
“fourth-grade slump,” evidence now suggests that the disparity between learn- 
ing to read and reading to learn starts much earlier but becomes apparent as 
tasks and assessments shift from narrative to informational, content-area reading 
[9,21]. Unfortunately, the fourth-grade slump neither begins in fourth grade, nor 
does it end there. Rather, the evidence suggests that early differences in reading 
skill widen over time. Those with high reading comprehension skill read more 
and become more skilled by practice, a positive-feedback loop [20]. Those with 
low reading comprehension skill read less, and their slowness in decoding delays 
identification of words by sight, which delays vocabulary growth, which in turn 
diminishes comprehension [30]. 

The importance of reading to learn has led to calls for interventions that 
embed comprehension activities in the learning of content areas [23]. The advan- 
tage of targeting comprehension in content areas is that, in addition to teleo- 
logical prior knowledge [28], content areas typically have their own specialized 
vocabulary and style distinct from narrative and informal conversation, making 
normal mechanisms for acquiring vocabulary and grammar, like implicit learn- 
ing, less efficient because of children’s reduced exposure to content-area text 
[7,22]. Vocabulary and comprehension are deeply intertwined because text must 
be decoded, disambiguated, and linked with prior knowledge for comprehension 
to occur [12]. Multiple studies investigating the impact of unknown words on 
comprehension suggest that the number of unknown words should be no lower 
than lin 20 if serious comprehension deficits are to be avoided [13], which is 
roughly less than one unknown word per sentence. 

Reading comprehension activities in educational contexts typically center 
around the instruction and practice of reading strategies. The definition of strat- 
egy is wide ranging and can include activities that occur before, during, or after 
reading of the text. Moreover, the strategies can be covert, artifact-producing, or 
interactive. For example, of the seven comprehension strategies recommended by 
the National Reading Panel (NRP) [23], comprehension monitoring and ques- 
tion generation are covert and occur during reading, graphic organizers and 
summarization are artifact producing and occur after reading, and cooperative 
learning, question answering, and reciprocal teaching are interactive and occur 
during reading. Arguably, activities that occur after reading, or tasks that are 
interactive, fall more into the realm of instructional activities than comprehen- 
sion strategies. Nevertheless, such activities can be highly effective for increasing 
comprehension of text. One possible explanation for the effectiveness of these 
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activities is the ICAP Hypothesis [6], which predicts that learning outcomes 
will follow the order interactive > constructive > active > passive because of 
the cognitive processes required by interactive, constructive, active, and passive 
activities. Of the NRP comprehension activities, all but monitoring are either 
constructive or interactive in nature, meaning that they require generating out- 
puts or co-generating outputs, respectively. 

Although interactive educational technologies have been developed, most 
notably in dialogue-based intelligent tutoring systems (ITS) [24], these systems 
currently have two weaknesses with respect to reading comprehension. First, 
these systems are primarily content-oriented rather than reading-oriented, mean- 
ing that students using the ITS may not do any particular reading during the 
learning process (though see [14] for a counterexample). Secondly, ITS content 
must be authored manually, and it is commonly believed that it takes several 
hundred hours of authoring effort to create one hour of instruction for an ITS 
using traditional methods [2], though research is beginning to make progress in 
automated authoring [25]. Because of authoring needs and challenges, it is not 
currently possible to automatically create a high-quality, interactive ITS for a 
given piece of text on demand. Accordingly, there are two options for educational 
technology. First, one could focus on interactive strategy training divorced from 
content with the aim of strategy transfer to other texts [18]. This is a worthwhile 
strategy but it does not directly support comprehension of an arbitrary piece of 
text. Secondly, one could step back from interactive activities and instead focus 
on constructive activities, which is the focus of the present work. 

This paper investigates an automated method for generating cloze items and 
the effect of practice with these items on reading comprehension. In a cloze task, 
a participant is asked to restore words that have been deleted from a text. Cloze 
tasks are well established for both vocabulary and comprehension instruction 
in addition to vocabulary and comprehension assessment [7,17,23]. Addition- 
ally, according to the ICAP theory, practice with cloze items is constructive 
because students must generate fill-in-the-blank answers, and constructive activ- 
ities facilitate transfer of learning to novel contexts. In this work our primary 
research questions are therefore (1) whether practice with machine generated 
cloze items promotes reading comprehension, (2) whether reading comprehen- 
sion with machine generated cloze items is equivalent to reading comprehension 
with human generated or random cloze items, and (3) whether reading compre- 
hension supported by machine cloze practice supports transfer. 


2 Method 


2.1 Design 


This study used a between-subjects design with the following conditions: Do- 
Nothing, Re-Read, Human Cloze, Machine Cloze, and Random Cloze conditions. 
All participants took pre-tests and read a text before being assigned to one of the 
conditions. Therefore, the Do-Nothing condition participants did nothing beyond 
the pre-test and reading. The Do-Nothing condition can be considered a business 
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as usual control condition, the Re-Read a stronger control condition where read- 
ing time is consistent with practice time in the cloze conditions, and Random 
Cloze another control condition where cloze practice occurs but items may not 
be optimal. All participants also took a post-test after a 24-hour delay. Test 
items with simple declarative answers, or fact questions, were concept-matched 
to test items with contextualized application questions, or transfer questions, 
such that a concept either appeared on the pre-test or on the post-test but not 
both. The purpose of concept matching was to eliminate the possibility that the 
pre-test cued participants on what to study for the post-test. 


2.2 Participants 


Participants were recruited through the Amazon Mechanical Turk (AMT) mar- 
ketplace between September and November of 2016. In this study, participants 
were required to be English speakers from the U.S. or Canada and required to 
have completed at least 50 previous AMT tasks with at least a 95% approval rat- 
ing. Experience/approval criteria were applied to prevent automated programs 
from attempting the experiment (i.e. “bots”) and to ensure quality from human 
participants. Participants were paid $3 for the first phase of the experiment and 
$2 for the second phase following the 24-hour retention interval. 

Age of participants in years was 18-25 (11%), 26-34 (45%), 35-54 (36%), 
55-64 (6%), and over 65 (2%), and participants were slightly more female (52%) 
than male (47%). Educational attainment of participants included less than 
high school (<1%), high school (12%), some college (35%), bachelor’s degree 
(43%), and graduate degree (9%). Over 95% of participants reported never hav- 
ing worked in a profession dealing with the circulatory system. 


2.3. Materials 


A text on the heart and circulatory system was derived from experimental mate- 
rials used by [33], which used four versions of the text ranging from elementary 
school to medical school difficulty. The text used in the present study was derived 
from elementary school level text, with modifications primarily removing the 
extraneous information present in the original. Examples of removed sentences 
include motivational/interest statements like “You probably think you know 
what the heart looks like. But you may be wrong.” , statements involving reader- 
oriented imagery like “You can feel the thumps if you press there with your 
hand. You can hear them with your ear.”, and statements that are thematically 
relevant but not directly relevant to the functioning of the heart and circulatory 
system like “When a fire burns, carbon dioxide is formed.” Both fact and trans- 
fer test items were created from the derived text by matching on a particular 
concept. For example, the heart is a pump concept has the associated fact ques- 
tion “Which component(s) of the circulatory system acts as a pump?” and the 
associated transfer question “Why doesn’t oxygen rich blood flow directly from 
the lungs to the rest of the body?” A total of 16 concept clusters were created, 
each having one associated fact and transfer question for a total of 32 questions. 
All questions were in multiple-choice format. 
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Cloze items for the three cloze conditions were created either by human, 
randomly, or by machine using an algorithm described below. Human cloze items 
were created by the same researcher who derived the text and created the pre- 
and post-test items. The researcher selected, at their discretion, the sentences 
capturing the main ideas of the text and the words central to each selected 
sentence’s meaning. The number of sentences (21) and words (53) selected by 
the human were then held constant in the random and machine generated cloze 
item conditions. Accordingly, all cloze conditions contained the same number 
of items, the items in each condition were generated from 21 sentences and 53 
words within those sentences, but each condition differed in terms of which 21 
sentences and 53 words were selected. 

Random cloze items were created by randomly selecting 21 sentences from 
all sentences in the text and randomly selecting between one and four words in 
each sentence such that the words were longer than two characters, words did 
not include “the” or “and,” and 53 words were selected in total. The random 
cloze generation procedure was repeated six times to create six sets of random 
cloze items, to minimize the chance the effects from this condition were due to 
an unusual random sample. 

Machine generated cloze items were selected by using natural language 
processing techniques at the word, sentence, and discourse level. Specifically, the 
entire text was parsed using syntactic, semantic, and discourse parsers [10, 16, 29]. 
These parsers annotated the text with a variety of information, including part 
of speech, word form/lemma, named entities, syntactic dependencies, verbal and 
nominal predicates, argument roles, coreference chains, elementary discourse 
units, and discourse dependencies. Because no labeled data was available, we 
used applied intuition and linguistic knowledge to develop a relatively simple 
heuristic for the selection of sentences and words. Sentences were selected pri- 
marily based on the number of coreference chains they contained (at least three) 
and the length of those chains (at least two). These criteria ensured that only 
sentences that were well connected to the discourse were preserved. Alternatively 
these criteria can be considered as argument overlap where anaphora, e.g. pro- 
nouns, have been resolved to their referents (cf. [4,31]). Once selected, sentences 
were filtered if they consisted of only satellite discourse units, i.e. discourse units 
that did not carry the core meaning of the discourse relationships in which they 
participated. Candidate cloze words for these sentences were selected based on 
whether the word was an argument in a coreference chain, a semantic argument, 
or a syntactic subject or object with a noun or modified noun part of speech. 
Final cloze words were chosen from candidates if they did not belong to the 
1000 most frequent words of English. For example, in the heart and circulatory 
system text, excluded candidate words included “heart,” “middle,” “blood,” and 
“body.” 


2.4 Procedure 


The experiment was delivered through the web interface of the MoFaCTS sys- 
tem [27] to AMT participants. Participants completed informed consent and 
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then took the pre-test. For each participant, 12 concept clusters were randomly 
selected from a test bank of 16 concept clusters. Four concepts were randomly 
assigned for pre-test, and eight concepts were randomly assigned for post-test. 
Since each concept had an associated fact and transfer question, the selection 
process yielded eight pre-test items and 16 post-test items. Order of items on 
each test was randomized. After the pre-test, participants read a text on the 
heart and circulatory system for at least 5 min and up to 10 min if they so chose. 
After reading the text, each participate completed one of five conditions: Do- 
Nothing, Re-Read, Human Cloze, Machine Cloze, or Random Cloze. Except for 
Do-Nothing, each of these conditions lasted from 5 min up to 25 min. Continuing 
longer than 5min was purely by participant choice. The text presented in the 
Re-Read condition was the same as the original text. Participants in the three 
cloze conditions received items specific to their condition. However all items were 
adaptively sequenced using the MoFaCTS system based on the success history 
of each item and model parameters inferred from pilot experimentation. During 
the cloze conditions, cloze items were presented on the screen and participants 
were asked to fill in the missing word(s) with a 15s timeout that was reset when- 
ever the participant typed. After an incorrect response, the correct response was 
displayed for 8s. Upon completing their condition, participants were paid for 
the first phase of the experiment. After a 24-hour retention interval, partici- 
pants were contacted via email from MoFaCTS to complete the second phase. 
The second phase consisted of a post-test, consisting of items not selected on 
the pre-test, presented in random order. Following the post-test, participants 
completed a demographic survey and were paid for the second phase of the 
experiment. 


3 Results and Discussion 


Although 365 participants attempted the experiment, 13 were excluded for var- 
ious reasons including using a friend’s account, server crashes, and collection 
errors, and 50 were excluded because they did not return for the post-test, i.e. 
were lost to attrition (N = 302). Each condition had approximately the same 
attrition (M = 11.6, SD = 1.64), within the acceptable range for attrition and 
differential attrition for educational research [32]. No outliers were removed or 
transformed. None of the demographic variables collected (age, gender, educa- 
tional attainment, professional knowledge of circulatory system) were signifi- 
cantly related to assigned condition under a chi-square test of independence. 
Table 1 shows the condition sample sizes and means, standard deviations, and 
95% confidence intervals for pre- and post-test proportion correct. 

Learning outcomes could not be analyzed as normalized gain scores, i.e. 
(post — pre) /(1 — pre), because this value was undefined for some participants. 
The choice of analysis between ANOVA on gain scores and ANCOVA on post- 
test using pre-test as a covariate was informed by recent guidance suggesting 
that when, as in the present study, differences in pre-test between conditions are 
substantial, d = .2, and correlation between simple learning gains and pre-test 
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Table 1. Proportion correct 


Group n Pre-test Post-test 
M (SD) 95% CI M (SD) 95% CI 


Do-Nothing 62 46 (.23)  [.41,.52] 54 (.20) _[.49, .59 
Re-Read 61 46(.19) [.41, 51] 57 (.23) —‘[.51, .63] 
Random Cloze 58 46 (18)  [.41,.51] .56(.18) —_‘[.51, .61] 
Human Cloze 60 51 (.18) [.46, .55] .61 (.21) [.56, .67] 
Machine Cloze 61 50 (.20) [.45, .55] 67 (.22) [.61, .73] 


Note: CI = confidence interval. 


is large, r(300) = —.5, ANOVA on gain scores is more likely to be biased than 
ANCOVA (see Table 5 of [11]). Therefore ANCOVA was adopted for all analyses. 
We conducted statistical tests at a = .05 to address our research questions. 

To answer our first research question, whether practice with machine gen- 
erated cloze items promotes reading comprehension, we ran an ANCOVA with 
condition and pre-test proportion correct as predictors and post-test proportion 
correct as the dependent variable. The model controlled for differences in pre-test 
across participants so that differences in post-test can be attributed to condition. 
The ANCOVA revealed a significant main effect of condition, F'(4,296) = 3.04, 
p = .02, Le = .04, as well as a main effect of pre-test proportion correct, 
F(1,296) = 53.95, p < .001, ne = .15. Post hoc comparisons between pre- 
dicted marginal means using Tukey’s HSD revealed that the Machine Cloze had 
significantly higher post-test proportion correct (M = .66, SE = .03) than the 
Do-Nothing condition (M = .55, SE = .03), t(296) = 3.21, p = .01, d = .58, 
C'Ig5|.21, .94]. No other pairwise comparisons were significant. 

An additional exploratory analysis was performed to investigate whether 
other variables or interactions omitted from the ANCOVA might qualify or limit 
these results. An exploratory ANCOVA model with condition, text reading time 
(log transformed), pre-test proportion correct, and all interactions as predic- 
tors and post-test proportion correct as the dependent variable was created and 
refined using backward elimination variable selection based on the Akaike infor- 
mation criterion (AIC). The only significant predictors in the exploratory model 
were condition and pre-test proportion correct, which were the same predictors 
in the a priori model. Diagnostic plots revealed no concerning departures from 
normality, heterogeneity, or violations of independence, suggesting the model 
was well-fitted. 

To answer our second research question, whether reading comprehension 
with machine generated cloze items is equivalent to reading comprehension with 
human generated or random cloze items, we ran an ANCOVA with the three 
cloze conditions, pre-test proportion correct, and variables controlling for the 
learning experience within the cloze conditions as predictors and post-test pro- 
portion correct as the dependent variable. The measured variables controlling for 
the learning experience within the cloze conditions included proportion correct 
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across trials, number of trials, and time. Because time and number of trials were 
highly correlated, r(176) = .94, and number of trials (log transformed) was more 
normally distributed than time, trials was included in the model and time was 
not included. Furthermore, because the learning experience necessarily involves 
correctness over time, an interaction between number of trials and proportion 
correct across trials was included. Thus the model controlled for differences in 
pre-test scores, number of trials, proportion correct across trials, and the inter- 
action of number of trials and proportion correct across trials so that differences 
in post-test can be attributed to condition. 

The ANCOVA revealed a significant main effect of condition, F'(2,171) = 
7.89, p < .001, 7? = .08, a main effect of pre-test proportion correct, F(1,171) = 
5.78, p = .02, 7, = .03, and a main effect of number of trials, F(1,171) = 9.80, 
p = .002, " = .05. A main effect of proportion correct across trials was not 
significant F'(1,171) = 1.57, p = .21, but the interaction of proportion cor- 
rect across trials and the number of trials was significant, F(1,171) = 10.27, 
p = .002, ne = .06. Examination of the interaction slope revealed that partic- 
ipants with low proportion correct across a high number of trials fared poorly 
on post-test proportion correct. Note that while only the main effect of condi- 
tion was relevant to our hypothesis, the effects of condition, number of trials, 
and the interaction of the number of trials and proportion correct across trials 
are statistically significant with Bonferroni adjusted alpha levels of .01 per test 
(a = .05/5). Post hoc comparisons between predicted marginal means using 
Tukey’s HSD revealed that the Machine Cloze had significantly higher post- 
test proportion correct (M = .66, SE = .02) than the Human Cloze condition 
(M = 58, SE = .02), t(171) = 2.69, p = .02, d = .49, Clpo5[.12, .86] and sig- 
nificantly higher post-test proportion correct than the Random Cloze condition 
(M = .54, SE = .02), t(171) = 3.88, p < .001, d= .71, Clo5[.34, 1.09]. 

An additional exploratory analysis was performed to investigate whether 
other variables or interactions omitted from the ANCOVA might qualify or limit 
these results. An exploratory ANCOVA model with condition, text reading time 
(log transformed), pre-test proportion correct, number of trials, proportion cor- 
rect across trials, and all two-way interactions as predictors and post-test propor- 
tion correct as the dependent variable was created and refined using backward 
elimination variable selection based on the Akaike information criterion (AIC). 
The significant predictors in the exploratory model were identical to the a pri- 
ori model except for the addition of a pre-test proportion correct by number of 
trials interaction, F(1,170) = 5.50, p = .02, 7 = .03. Examination of the inter- 
action slope revealed that participants with low pre-test proportion correct who 
experienced a high number of trials fared better on post-test proportion correct 
while participants with high pre-test proportion correct who experience a high 
number of trials fared more poorly. Though this interaction is sensible, it should 
be treated with caution because it was obtained through variable selection [8]. 
The most useful finding of the exploratory ANCOVA is that it did not alter the 
significant effect of condition or contrasts found in the a priori ANCOVA. Diag- 
nostic plots revealed no concerning departures from normality, heterogeneity, or 
violations of independence, suggesting the model was well-fitted. 
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To answer our final research question, whether reading comprehension with 
machine generated cloze items supports transfer, we re-ran ANCOVAs with test 
scores based on the transfer questions alone. An ANCOVA for transfer post-test 
proportion correct using condition and transfer pre-test proportion correct as 
predictors yielded virtually the same effects and contrasts as the ANCOVA for 
all test items. There was a significant main effect of condition, F'(4,296) = 2.59, 
p = .04, " = .03, as well as a main effect of pre-test proportion correct, 
F(1, 296) = 23.34, p < .001, its = .07. Post hoc comparisons between predicted 
marginal means using Tukey’s HSD revealed that Machine Cloze had signifi- 
cantly higher transfer post-test proportion correct (M = .61, SE = .03) than 
the Do-Nothing condition (M = .50, SE = .03), t(296) = 2.82, p = .04, d= .51, 
C'Ig5[.15, .87]. No other pairwise comparisons were statistically significant. An 
ANCOVA for transfer post-test proportion correct using the three cloze con- 
ditions, pre-test proportion correct, number of trials, proportion correct across 
trials, and the interaction of number of trials and proportion correct as predic- 
tors also yielded virtually the same effects and contrasts as the ANCOVA for 
all test items. There was a significant main effect of condition, F'(2,171) = 6.52, 
p = .002, ne = .07, a main effect of pre-test proportion correct, F'(1,171) = 3.98, 
p = .05, " = .02, and a main effect of number of trials, F(1,171) = 9.13, 
p = .008, " = .05. A main effect of proportion correct across trials was not 
significant F'(1,171) = 0.56, p = .46, but the interaction of proportion cor- 
rect across trials and the number of trials was significant, F(1,171) = 7.45, 
p = .007, " = .04. Examination of the interaction slope revealed that partici- 
pants with low proportion correct across a high number of trials fared poorly on 
post-test proportion correct. Post hoc comparisons between predicted marginal 
means using Tukey’s HSD revealed that the Machine Cloze had significantly 
higher transfer post-test proportion correct (M = .61, SE = .02) than the 
Human Cloze condition (M = .52, SE = .03), t(171) = 2.71, p = .02, d = 5, 
C'g5[.13, .86] and significantly higher transfer post-test proportion correct than 
the Random Cloze condition (M = .49, SE = .03), t(171) = 3.42, p = .002, 
d = .63, CIg5[-26, 1.0]. 

Our main findings were that the Machine Cloze condition led to superior post- 
test outcomes relative to other conditions, including Human Cloze when learning 
experience variables are controlled for, and that these findings hold both overall 
and for a subset of pre- and post-test questions specifically targeting transfer. 
The causal mechanism behind the advantage for the Machine Cloze condition 
is currently unclear. An examination of the Human Cloze and Machine Cloze 
items revealed 13 sentences in common out of 21. Presumably differences in 
learning between the Human and Machine Cloze conditions are attributable to 
the items not shared and their interactions with the items in common. Recall 
that the primary features for selecting the Machine Cloze sentences were based 
on coreference chains. Sentences with more chains and with longer chains are 
more connected to the discourse by virtue of echoing or extending ideas present 
in other sentences. For the eight items not shared, the sum of Machine Cloze 
coreference lengths was 221 and the sum of Human Cloze coreference weights was 
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67, meaning that the Machine Cloze items were approximately three times more 
connected to the discourse than the Human Cloze items. Whether differences in 
coreference chains can explain differences in post-test performance is a matter 
for future research. 


4 Conclusion 


Results from the study suggest that cloze items generated by machine using 
natural language processing techniques are effective for enhancing reading com- 
prehension when delivered by an adaptive practice scheduling system. Because 
such cloze items can be generated automatically, ostensibly for any text, our find- 
ings potentially have broad implications for improving reading comprehension 
in educational settings. An important limitation on these implications, however, 
is that these results were obtained for a single text only and in comparison 
to human-generated items by a single individual. It may be that the natural 
language processing techniques used were particularly suitable to this text and 
would not be as effective for other texts or that these techniques would not fare 
as well against items generated by a domain expert. Two important targets for 
future research are to replicate this finding with other texts in other domains 
and to better understand the properties of the machine generated cloze items 
that made them more effective than human generated cloze items. 
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